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Abstract 

The standard pipeline approach to se- 
mantic processing, in which sentences 
are morphologically and syntactically 
resolved to a single tree before they 
are interpreted, is a poor fit for ap- 
plications such as natural language in- 
terfaces. This is because the environ- 
ment information, in the form of the ob- 
jects and events in the application's run- 
time environment, cannot be used to in- 
form parsing decisions unless the in- 
put sentence is semantically analyzed, 
but this does not occur until after pars- 
ing in the single-tree semantic architec- 
ture. This paper describes the compu- 
tational properties of an alternative ar- 
chitecture, in which semantic analysis 
is performed on all possible interpre- 
tations during parsing, in polynomial 
time. 

1 Introduction 

Shallow semantic processing applications, com- 
paring argument structures to search patterns 
or filling in simple templates, can achieve re- 
spectable results using the standard 'pipeline' ap- 
proach to semantics, in which sentences are mor- 
phologically and syntactically resolved to a single 
tree before being interpreted. Putting disambigua- 
tion ahead of semantic evaluation is reasonable in 
these applications because they are primarily run 
on content like newspaper text or dictated speech, 
where no machine-readable contextual informa- 
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tion is readily available to provide semantic guid- 
ance for disambiguation. 

This single-tree semantic architecture is a poor 
fit for applications such as natural language inter- 
faces however, in which a large amount of contex- 
tual information is available in the form of the ob- 
jects and events in the application's run-time en- 
vironment. This is because the environment infor- 
mation cannot be used to inform parsing and dis- 
ambiguation decisions unless the input sentence 
is semantically analyzed, but this does not occur 
until after parsing in the single-tree architecture. 
Assuming that no current statistical disambigua- 
tion technique is so accurate that it could not ben- 
efit from this kind of environment-based informa- 
tion (if available), then it is important that the se- 
mantic analysis in an interface architecture be ef- 
ficiently performed during parsing. 

This paper describes the computational prop- 
erties of one such architecture, embedded within 
a system for giving various kinds of conditional 
instructions and behavioral constraints to virtual 
human agents in a 3-D simulated environment 
(Bindiganavale et al., 2000). In one application 



of this system, users direct simulated maintenance 
personnel to repair a jet engine, in order to ensure 
that the maintenance procedures do not risk the 
safety of the people performing them. Since it is 
expected to process a broad range of maintenance 
instructions, the parser is run on a large subset 
of the Xtag English grammar (X TAG Research 
Group, 1998), which has been annotated with lex- 
ical semantic classes ( Kipper et al., 2000 ) associ- 
ated with the objects, states, and processes in the 
maintenance simulation. Since the grammar has 
several thousand lexical entries, the parser is ex- 
posed to considerable lexical and structural ambi- 
guity as a matter of course. 

The environment-based disambiguation archi- 
tecture described in this paper has much in 
common with very early environment-based ap- 
proaches, such as those described by Winograd 



( Winograd, 1972 ), in that it uses the actual en- 
tities in an environment database to resolve am- 
biguity in the input. This research explores two 
extensions to the basic approach however: 

1. It incorporates ideas from type theory to rep- 
resent a broad range of linguistic phenomena 
in a manner for which their extensions or po- 
tential referents in the environment are well- 
defined in every case. This is elaborated in 
Section |[ 

2. It adapts the concept of structure sharing, 
taken from the study of parsing, not only to 
translate the many possible interpretations of 
ambiguous sentences into shared logical ex- 
pressions, but also to evaluate these sets of 
potential referents, over all possible interpre- 
tations, in polynomial time. This is elabo- 
rated in Section |3[ 

Taken together, these extensions allow interfaced 
systems to evaluate a broad range of natural lan- 
guage inputs - including those containing NPA'^P 
attachment ambiguity and verb sense ambiguity 
- in a principled way, simply based on the ob- 
jects and events in the systems' environments. 
For example, such a system would be able to cor- 
rectly answer 'Did someone stop the test at 3:00?' 
and resolve the ambiguity in the attachment of 'at 
3:00' just from the fact that there aren't any 3:00 
tests in the environment, only an event where one 
stops at 3:00.0 Because it evaluates instructions 
before attempting to choose a single interpreta- 
tion, the interpreter can avoid getting 'stranded' 
by disambiguation errors in earlier phases of anal- 
ysis. 

The main challenge of this approach is that it 
requires the efficient calculation of the set of ob- 
jects, states, or processes in the environment that 
each possible sub-derivation of an input sentence 
could refer to. A semantic interpreter could al- 
ways be run on an (exponential) enumerated set 
of possible parse trees as a post-process, to fil- 
ter out those interpretations which have no en- 
vironment referents, but recomputing the poten- 
tial environment referents for every tree would re- 
quire an enormous amount of time (particularly 



It is important to make a distinction between this envi- 
ronment information, wliicii just describes the set of objects 
and events that exist in the interfaced application, and what 
is often called domain information, which describes (usually 
via hand-written rules) the kinds of objects and events can 
exist in the interfaced application. The former comes for free 
with the application, while the latter can be very expensive 
to create and port between domains. 



for broad coverage grammars such as the one em- 
ployed here). The primary result of this paper is 
therefore a method of containing the time com- 
plexity of these calculations to lie within the com- 
plexity of parsing (i.e. within 0{v?) for a context- 
free grammar, where n is the number of words 
in the input sentence), without sacrificing logi- 
cal correctness, in order to make environment- 
based interpretation tractable for interactive appli- 
cations. 

2 Representation of referents 

Existing environment-based methods (such as 
those proposed by Winograd) only calculate the 
referents of noun phrases, so they only consult the 
objects in an environment database when inter- 
preting input sentences. But the evaluation of am- 
biguous sentences will be incomplete if the refer- 
ents for verb phrases and other predicates are not 
calculated. In order to evaluate the possible inter- 
pretations of a sentence, as described in the previ- 
ous section, an interface needs to define referent 
sets for every possible constituent]^ 

The proposed solution draws on a theory of 
constituent types from formal linguistic seman- 
tics, in which constituents such as nouns and verb 
phrases are represented as composeable functions 
that take entitiess or situations as inputs and ulti- 
mately return a truth value for the sentence. Fol- 
lowing a straightforward adaptation of standard 
type theory, common nouns (functions from en- 
tities to truth values) define potential referent sets 
of simple environment entities: {ei, 62, 63, . . .}, 
and sentences (functions from situations or world 
states to truth values) define potential referent sets 
of situations in which those sentences hold true: 
{(Ji, (J2, o"3, . . .}. Depending on the needs of the 
application, these situations can be represented 
as intervals along a time line (Allen and Fergu- 
son, 1994), or as regions in a three-dimensional 
space ( Xu and Badler, 20001 ), or as some com- 
bination of the two, so that they can be con- 
strained by modifiers that specify the situations' 
times and locations. Referents for other types 
of phrases may be expressed as tuples of enti- 
ties and situations: one for each argument of the 

^This is not strictly true, as referent sets for constituents 
like determiners are difficult to define, and others (particu- 
larly those of quantifiers) will be extremely large until com- 
posed with modifiers and arguments. Fortunately, as long 
as there is a bound on the height in the tree to which the 
evaluation of referent sets can be deferred (e.g. after the first 
composition), the claimed polynomial complexity of refer- 
ent annotation will not be lost. 



corresponding logical function's input (with the 
presence or absence of the tuple representing the 
boolean output). For example, adjectives, prepo- 
sitional phrases, and relative clauses, which are 
typically represented as situationally-dependent 
properties (functions from situations and entities 
to truth values) define potential referent sets of tu- 
ples that consist of one entity and one situation: 
{(ei,cri), (62,0-2), (63,0-3), . . .}. This represen- 
tation can be extended to treat common nouns 
as situationally-dependent properties as well, in 
order to handle sets like 'bachelors' that change 
their membership over time. 

3 Sharing referents across 
interpretations 

Any method for using the environment to guide 
the interpretation of natural language sentences 
requires a tractable representation of the many 
possible interpretations of each input. The 
representation described here is based on the 
polynomial-sized chart produced by any dynamic 
programming recognition algorithm. 

A record of the derivation paths in any dy- 
namic programming recognition algorithm (such 
as C KY dCocke and Schwartz, 197^; Kasam i, 



1965; [Younger, 1967D or Barley ( parley, 1970D ) 
can be interpreted as a polynomial sized and- 
or graph with space complexity equal to the 
time complexity of recognition, whose disjunc- 
tive nodes represent possible constituents in the 
analysis, and whose conjunctive nodes represent 
binary applications of rules in the grammar. This 
is called a shared forest of parse trees, because it 
can represent an exponential number of possible 
parses using a polynomial number of nodes which 
are s hared between alternative analyses ( Tomita, 
1985; [Billot and Lang, 1989D , and caiTbe con- 
structed and traversed in time of the same com- 
plexity (e.g. 0{n^) for context free grammars). 
For example, the two parse trees for the noun 
phrase 'button on handle beside adapter' shown 
in Figure |l| can be merged into the single shared 
forest in Figure ^ without any loss of information. 

These shared syntactic structures can further 
be associated with compositional semantic func- 
tions that correspond to the syntactic elements 
in the forest, to create a shared forest of trees 
each representing a complete expression in some 
logical form. This extended sharing is similar 
to the 'packing' approach employed in the Core 
Language Engine ( Alshawi, 1992 ), except that 
the CLE relies on a quasi-logical form to under- 



specify semantic information such as quantifier 
scope (the calculation of which is deferred un- 
til syntactic ambiguities have been at least par- 
tially resolved by other means); whereas the ap- 
proach described here extends structure sharing to 
incorporate a certain amount of quantifier scope 
ambiguity in order to allow a complete eval- 
uation of all subderivations in a shared forest 
before making any disambiguation decisions in 
syntax.^ Various synchronous formalisms have 
been introduced for associating syntactic repre- 
sentations with logical functions in isomorphic 
or locally non-isomorphic derivations, includ- 
ing Categorial Grammars (CGs) ( Wood, 1993 ), 
Synchronous Tree Adjoining Grammars (TAGs) 
( [Joshi, 1985[ ; [Shieber and Schabes, 1990| ; Shieber, 
1994), and Synchronous Description Tree Gram- 
mars (DTGs) ( Rambow et al., 1995 ; Rambow and 
Satta, 1996). Most of these formalisms can be ex- 
tended to define semantic associations over entire 
shared forests, rather than merely over individual 
parse trees, in a straightforward manner, preserv- 
ing the ambiguity of the syntactic forest without 
exceeding its polynomial size, or the polynomial 
time complexity of creating or traversing it. 

Since one of the goals of this architecture is 
to use the system's representation of its environ- 
ment to resolve ambiguity in its instructions, a 
space-efficient shared forest of logical functions 
will not be enough. The system must also be able 
to efficiently calculate the sets of potential refer- 
ents in the environment for every subexpression in 
this forest. Fortunately, since the logical function 
forest shares structure between alternative anal- 
yses, many of the sets of potential referents can 
be shared between analyses during evaluation as 
well. This has the effect of building a third shared 
forest of potential referent sets (another and-or 
graph, isomorphic to the logical function forest 
and with the same polynomial complexity), where 
every conjunctive node represents the results of 
applying a logical function to the elements in that 
node's child sets, and every disjunctive node rep- 
resents the union of all the potential referents in 
that node's child sets. The presence or absence 
of these environment referents at various nodes 
in the shared forest can be used to choose a vi- 
able parse tree from the forest, or to evaluate the 
truth or falsity of the input sentence without dis- 



A similar basis on (at least partially) disambiguated syn- 
tactic representations makes similar un derspecified semantic 
representations such as hole semantics ( Bos, 19951 ) ill-suited 
for environment-based syntactic disambiguation. 





Figure 1: Example parse trees for 'button on handle beside adapter' 




Figure 2: Example shared forest for "button on handle beside adapter" 



ambiguating it (by checking the presence or lack 
of referents at the root of the forest). 

For example, the noun phrase 'button on han- 
dle beside adapter' has at least two possible in- 
terpretations, represented by the two trees in Fig- 
ure |l|: one in which a button is on a handle and 
the handle (but not necessarily the button) is be- 
side an adapter, and the other in which a button is 
on a handle and the button (but not necessarily the 
handle) is beside an adapter. The semantic func- 
tions are annotated just below the syntactic cat- 
egories, and the potential environment referents 
are annotated just below the semantic functions 
in the figure. Because there are no handles next 
to adapters in the environment (only buttons next 
to adapters), the first interpretation has no envi- 
ronment referents at its root, so this analysis is 
dispreferred if it occurs within the analysis of a 
larger sentence. The second interpretation does 
have potential environment referents all the way 
up to the root (there is a button on a handle which 
is also beside an adapter), so this analysis is pre- 
ferred if it occurs within the analysis of a larger 
sentence. 

The shared forest representation effectively 
merges the enumerated set of parse trees into a 
single data structure, and unions the referent sets 
of the nodes in these trees that have the same la- 
bel and cover the same span in the string yield 
(such as the root node, leaves, and the PP cover- 



ing 'beside adapter' in the examples above). The 
referent-annotated forest for this sentence there- 
fore looks like the forest in Figure ^, in which the 
sets of buttons, handles, and adapters, as well as 
the set of things beside adapters, are shared be- 
tween the two alternative interpretations. If there 
is a button next to an adapter, but no handle next 
to an adapter, the tree representing 'handle beside 
adapter' as a constituent may be dispreferred in 
disambiguation, but the NP constituent at the root 
is still preferred because it has potential referents 
in the environment due to the other interpretation. 

The logical function at each node is defined 
over the referent sets of that node's immediate 
children. Nodes that represent the attachment of 
a modifier with referent set M to a relation with 
referent set R produce referent sets of the form: 

{ (x, ...) I (x, ...)Gi? A (x)gM} 

Nodes in a logical function forest that represent 
the attachment of an argument with referent set A 
to a relation with referent set R produce referent 
sets of the form: 

{(...) I 3x.{...,x)eR A {x)eA} 

effectively stripping off one of the objects in each 
tuple if the object is also found in the set of refer- 
ents for the argument.^ This is a direct application 

■*In order to show where the referents came from, the tu- 



VP[drained] 

drain' {x) 
















PP[after] ^ 

after" {x, y) 




VP[dramed] 


P[after] 


drain' [x) 


after' {x,y) 


{{c7i,ei), {(T2,e2>} 





NP[test] 

test'{x) 
{Ti,T2,r3} 




PP[at] 

at"(x.,y) 

{(<7l,tl),(pi,tl)} 



P[at] 

at'(x.,y) 



NP[3:00] 
constant ti 



Figure 3: Example shared forest for verb phrase "drained after test at 3:00" 



of standard type theory to the calculation of ref- 
erent sets: modifiers take and return functions of 
the same type, and arguments must satisfy one of 
the input types of an applied function. 

Since both of these 'referent set composition' 
operations at the conjunctive nodes - as well as 
the union operation at the disjunctive nodes - are 
linear in space and time on the number of ele- 
ments in each of the composed sets (assuming the 
sets are sorted in advance and remain so), the cal- 
culation of referent sets only adds a factor of |f | 
to the size complexity of the forest and the time 
complexity of processing it, where |<f | is the num- 
ber of objects and events in the run- time environ- 
ment. Thus, the total space and time complexity 
of the above algorithm (on a context-free forest) is 
0{\£\n^). If other operations are added, the com- 
plexity of referent set composition will be limited 
by the least efficient operation. 

3.1 Temporal referents 

Since the referent sets for situations are also well 
defined under type theory, this environment-based 
approach can also resolve attachment ambigui- 



ple objects are not stripped off in Figures |i| and ^. Instead, 
an additional bar is added to the function name to designate 
the effective last object in each tuple: the tuple (6i , a\) ref- 
erenced by beside has ai as the last element, but the tuple 
referenced by beside" actually has bi as the last element 
since the complement a\ has been already been attached. 



ties involving verbs and verb phrases in addition 
to those involving only nominal referents. For 
example, if the interpreter is given the sentence 
"Coolant drained after test at 3:00," which could 
mean the draining was at 3:00 or the test was at 
3:00, the referents for the draining process and 
the testing process can be treated as time intervals 
in the environment history.^ First, a forest is con- 
structed which shares the subtrees for "the test" 
and "after 3:00," and the corresponding sets of 
referents. Each node in this forest (shown in Fig- 
ure ^) is then annotated with the set of objects and 
intervals that it could refer to in the environment. 
Since there were no testing intervals at 3:00 in the 
environment, the referent set for the NP 'test after 
3:00' is evaluated to the null set. But since there 
is an interval corresponding to a draining process 
(cji) at the root, the whole VP will still be pre- 
ferred as constituent due to the other interpreta- 
tion. 

3.2 Quantifier scoping 

The evaluation of referents for quantifiers also 
presents a tractability problem, because the func- 
tions they correspond to in the Montague analy- 
sis map two sets of entities to a truth value. This 



The composition of time intervals, as well as spatial re- 
gions and other types of situational referents, is more com- 
plex than that outlined for objects, but space does not permit 
a complete explanation. 



means that a straightforward representation of the 
potential referents of a quantifier such as 'at least 
one' would contain every pair of non-empty sub- 
sets of the set f of all entities, with a cardinal- 
ity on the order of 2^I^L If the evaluation of ref- 
erents is deferred until quantifiers are composed 
with the common nouns they quantify over, the 
input sets would still be as large as the power sets 
of the nouns' potential referents. Only if the eval- 
uation of referents is deferred until complete NPs 
are composed as arguments (as subjects or objects 
of verbs, for example) can the output sets be re- 
stricted to a tractable size. 

This provision only covers in situ quantifier 
scopings, however. In order to model raised scop- 
ings, arbitrarily long chains of raised quantifiers 
(if there are more than one) would have to be eval- 
uated before they are attached to the verb, as they 
are in a CCG -style functi on composition analy- 
sis of raising ( Park, 1996 ).p| Fortunately, univer- 
sal quantifiers like 'each' and 'every' only choose 
the one maximal set of referents out of all the pos- 
sible subsets in the power set, so any number of 
raised universal quantifier functions can be com- 
posed into a single function whose referent set 
would be no larger than the set of all possible en- 
tities. 

It may not be possible to evaluate the poten- 
tial referents of non-universal raised quantifiers 
in polynomial time, because the number of po- 
tential subsets they take as input is on the or- 
der of the power set of the noun's potential ref- 
erents. This apparent failure may hold some ex- 
planatory power, however, since raised quantifiers 
other than 'each' and 'every' seem to be exceed- 
ingly rare in the data. This scarcity may be a re- 
sult of the significant computational complexity 
of evaluating them in isolation (before they are 
composed with a verb). 

4 Evaluation 

An implemented system incorporating this 
environment-based approach to disambiguation 
has been tested on a set of manufacturer- 
supplied aircraft maintenance instructions, using 
a computer-aided design (CAD) model of a 
portion of the aircraft as the environment. It 
contains several hundred three dimensional 



This approach is in some sense wedded to a CCG-style 
syntacto-semantic analysis of quantifier raising, inasmuch as 
its syntactic and semantic structures must be isomorphic in 
order to preserve the polynomial complexity of the shared 
forest. 



objects (buttons, handles, sliding couplings, etc), 
labeled with object type keywords and connected 
to other objects through joints with varying 
degrees of freedom (indicating how each object 
can be rotated and translated with respect to other 
objects in the environment). 

The test sentences were the manufacturer's in- 
structions for replacing a piece of equipment in 
this environment. The baseline grammar was not 
altered to fit the test sentences or the environment, 
but the labeled objects in the CAD model were 
automatically added to the lexicon as common 
nouns. 

In this preliminary accuracy test, forest nodes 
that correspond to noun phrase or modifier cate- 
gories are dispreferred if they have no potential 
entity referents, and forest nodes corresponding 
to other categories are dispreferred if their argu- 
ments have no potential entity referents. Many 
of the nodes in the forest correspond to noun- 
noun modifications, which cannot be ruled out by 
the grammar because the composition operation 
that generates them seems to be productive (vir- 
tually any 'N2' that is attached to or contained in 
an 'Nl' can be an 'Nl N2'). Potential referents 
for noun-noun modifications are calculated by a 
rudimentary spatial proximity threshold, such that 
any potential referent of the modified noun lying 
within the threshold distance of a potential ref- 
erent of the modifier noun in the environment is 
added to the composed set. 

The results are shown below. The average num- 
ber of parse trees per sentence in this set was 
21 before disambiguation. The average ratio of 
nodes in enumerated tree sets to nodes in shared 
forests for the instructions in this test set was 
9.6 : 1, a nearly tenfold reduction due to sharing. 

Gold standard 'correct' trees were annotated 
by hand using the same grammar that the parser 
uses. The success rate of the parser in this do- 
main (the rate at which the correct tree could be 
found in the parse forest) was 87%. The reten- 
tion rate of the environment-based filtering mech- 
anism described above (the rate at which the cor- 
rect tree was retained in parse forest) was 92% 
of successfully parsed sentences. The average 
reduction in number of possible parse trees due 
to the environment-based filtering mechanism de- 
scribed above was 3.6 : 1 for successfully parsed 
and filtered forests]] 



'Sample parse forests and other details of 
this application and environment are available at 

ittp : / / www . cis . upenn . edu/ ~schuler / ebd . html 
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5 Conclusion 

This paper has described a method by which the 
potential environment referents for all possible in- 
terpretations of of an input sentence can be evalu- 
ated during parsing, in polynomial time. The ar- 
chitecture described in this paper has been imple- 
mented with a large coverage grammar as a run- 
time interface to a virtual human simulation. It 
demonstrates that a natural language interface ar- 
chitecture that uses the objects and events in an 
application's run-time environment to inform dis- 
ambiguation decisions (by performing semantic 
evaluation during parsing) is feasible for interac- 
tive appUcations. 
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